딥 강화 학습(DRL) 소개

딥 강화 학습(DRL)은 심층 신경망 의 고차원 표현 능력과 강화 학습의 최적 제어 프레임워크를 결합합니다. 지도학습이나 비지도학습과 달리, DRL는 에이전트 는 동적인 환경 내에서 시행착오를 통해 상호작용하며, 즉각적이거나 명시적인 레이블 없이 환경순차적 결정을 내립니다. 이 통합은 에이전트가 복잡한 원시 입력(예: 픽셀 데이터)을 직접 처리할 수 있도록 합니다. 순차적 결정 without immediate, explicit labels. This integration allows agents to handle complex, raw inputs (like pixel data) directly.

1. DRL 학습 패러다임

강화 학습 에이전트는 지속적인 루프에서 작동합니다: 환경의 상태($S_t$)를 관측하고, 행동($A_t$)을 수행하며, 가능성이 적거나 지연된 스칼라 형태의 보상($R_{t+1}$)을 받습니다. 주요 과제는 신용 할당 문제입니다. 즉, 미래의 보상 신호에 기여한 과거 행동들이 무엇인지 판단하는 것입니다.

2. 최적화 목표

최종 목표는 최적의 전략, 즉 정책($\pi^*$)을 찾는 것입니다. 이는 상태에서 행동으로 매핑하는 함수이며, 기대 누적 할인 보상($G_t$)을 최대화하는 것입니다. 할인 인자($\gamma \in [0, 1]$) 는 수학적으로 매우 중요하며, 즉각적인 보상과 먼 미래에 예상되는 보상 중 어느 쪽을 더 중시할지를 정의합니다.

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The Fundamental RL Cycle

An illustration of the Markov Decision Process (MDP) framework. The Agent's policy dictates the action ($A_t$) based on the current state ($S_t$), leading the Environment to transition to a new state ($S_{t+1}$) and provide a reward ($R_{t+1}$).

The Reinforcement Learning Cycle: Agent, Environment, State, Action, Reward

Question 1

How does the DRL agent receive feedback from the environment?

Explicit labels/targets

Backpropagation through time

Scalar reward signal

Labeled demonstration data

Question 2

What does the policy ($\pi$) mathematically represent?

The predicted total reward

A distribution over actions given a state

The probability of transitioning to a new state

The error between predicted and actual returns

Challenge: The Discount Factor

Analyzing the Temporal Horizon.

Consider two scenarios:
1. $\gamma = 0$
2. $\gamma \approx 1$

Describe the agent's behavioral preference in each case regarding the timeline of rewards.

Step 1

How does the choice of $\gamma$ affect the policy's horizon?

Solution:
If $\gamma = 0$, the agent is myopic (shortsighted), focusing only on the immediate reward $R_{t+1}$. If $\gamma \approx 1$, the agent is far-sighted, equally weighting immediate and distant future rewards, leading to planning over a very long horizon.